Modeling count data that has been collected over different levels of exposure using an offset.
General Principles
When we want to model count data, where the counts are observed over different periods or areas of exposure, we use a Poisson model with an offset. This is a type of generalized linear model used for modeling count data and contingency tables.
An offset is a predictor variable with a coefficient that is fixed at 1. It is used to account for the “exposure” variable, which represents the opportunity for an event to occur. For instance, if we are counting the number of sick individuals in different cities, the population of each city would be the exposure variable. A city with a larger population is expected to have more sick individuals. The offset accounts for this by essentially modeling the rate of events per unit of exposure.
Considerations
Note
The dependent variable in a Poisson regression must be a non-negative count.
The exposure variable used as an offset cannot contain zeros.
A key assumption of the Poisson distribution is that the mean and variance of the count variable are equal. If the variance is greater than the mean, a condition known as overdispersion, a Negative Binomial regression might be more appropriate.
The logarithm of the exposure variable is typically used as the offset. This is because Poisson regression models the logarithm of the expected count. By including the log of the exposure as an offset, we are effectively modeling the rate.
Example
Below is an example of code that demonstrates a Bayesian Poisson regression with an offset. The data consists of the number of elephant aggressions (agressions), the age of the elephants (age), and the number of years they have been observed (years_obs). The goal is to model the rate of aggressions per year, accounting for the age of the elephants.
from BayesForge import bfimport jax.numpy as jnpimport numpy as np# Setup device------------------------------------------------m = bf(platform='cpu')# Simulated data ------------------------------------------------# Using numpy for simulation to easily set the true valuesnp.random.seed(42)N =100population = np.random.normal(0, 1, size=N)cid = np.random.binomial(1, 0.5, size=N)hours = np.random.uniform(1, 10, size=N)true_a = np.array([2.5, 3.5])true_b = np.array([-0.1, 0.2])l = hours * np.exp(true_a[cid] + true_b[cid] * population)total_tools = np.random.poisson(l)print(f"True a: {true_a}")print(f"True b: {true_b}")# Model data ------------------------------------------------def model_offset(cid, population, hours, total_tools): a = m.dist.normal(3, 0.5, shape=(2,), name='a') b = m.dist.normal(0, 0.2, shape=(2,), name='b') l = hours * jnp.exp(a[cid] + b[cid]*population) m.dist.poisson(l, obs=total_tools)m.data_on_model =dict(cid=cid, population=population, hours=hours, total_tools=total_tools)# Run sampler ------------------------------------------------m.fit(model_offset, progress_bar=False)# Diagnostic ------------------------------------------------summ = m.summary()print("\nSummary results:")print(summ[['mean']])
/home/sosa/work/3.12venv/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning:
IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html